fix(benchmark): assign unique color per model in scatter plots#231
Merged
Conversation
Three benchmark scatter renderers (Pareto, precision/recall, token efficiency) drew from tab20 with `i % 20`, so any run with more than 20 models silently reused colors. With 28 models in the current report the legend now had multiple identical dots. Add `_distinct_colors(n)` that concatenates tab20 + tab20b + tab20c (60 categorical colors) and falls back to evenly-spaced hsv past that. All three renderers now call it and index `colors[i]`. The three regenerated SVGs in this commit reflect the fix. Benchmark-only; not shipped in the runtime image; no version bump.
Re-renders report.md plus every SVG in results/report_assets/ from the current calls.jsonl + corpus, so the published artifacts match the rest of the repo's snapshot pattern. The substantive change is still only the three scatter plots (pareto, precision_recall, token_efficiency) where each model now gets a unique color; the other SVGs and report.md only differ in matplotlib-generated element IDs, embedded timestamps, and PASS-row ordering for ties.
ttlequals0
added a commit
that referenced
this pull request
May 17, 2026
Two HIGH-severity findings from the security audit on main (#231 follow-up). 1. Rotate Flask session on /auth/login and /auth/password. Previously `session['authenticated']=True` mutated the existing signed cookie. Any pre-login token (XSS on a sibling subdomain, MITM with insecure transport) would retain the same identifier post-auth. `session.clear()` before re-setting `permanent=True` and `authenticated=True` rotates the cookie so a pre-auth token cannot ride the new authenticated state. SameSite=Strict + Secure already mitigate but do not eliminate the fixation risk. 2. Refuse to start when SECRET_KEY mint cannot be coordinated. Previously get_or_create_secret_key() would warn-and-continue when the flock fallback failed; if the DB write then also failed, two gunicorn workers could mint divergent keys and silently invalidate each other's cookies. Now: if lock_fd is None AND the DB write raises, we raise a new SecretKeyUnavailableError that fails startup. Operators can bypass by setting SECRET_KEY via env. Behavior change: deploys with no SECRET_KEY env AND a broken data dir AND a failing DB will now refuse to boot instead of starting with a divergent key. This is intentional.
ttlequals0
added a commit
that referenced
this pull request
May 17, 2026
…splits) (#232) * fix(security): rotate session on login + fail loud on missing secret key Two HIGH-severity findings from the security audit on main (#231 follow-up). 1. Rotate Flask session on /auth/login and /auth/password. Previously `session['authenticated']=True` mutated the existing signed cookie. Any pre-login token (XSS on a sibling subdomain, MITM with insecure transport) would retain the same identifier post-auth. `session.clear()` before re-setting `permanent=True` and `authenticated=True` rotates the cookie so a pre-auth token cannot ride the new authenticated state. SameSite=Strict + Secure already mitigate but do not eliminate the fixation risk. 2. Refuse to start when SECRET_KEY mint cannot be coordinated. Previously get_or_create_secret_key() would warn-and-continue when the flock fallback failed; if the DB write then also failed, two gunicorn workers could mint divergent keys and silently invalidate each other's cookies. Now: if lock_fd is None AND the DB write raises, we raise a new SecretKeyUnavailableError that fails startup. Operators can bypass by setting SECRET_KEY via env. Behavior change: deploys with no SECRET_KEY env AND a broken data dir AND a failing DB will now refuse to boot instead of starting with a divergent key. This is intentional. * chore: phase-1 audit cleanups (compose, docker, hooks, ASCII, imports, dead alias) Bundle of low-risk hygiene from the 2026-05-16 codebase audit. No behavior change to any user-visible flow. Container / infra - docker-compose.yml + docker-compose.cpu.yml: add security_opt: ["no-new-privileges:true"] and cap_drop: [ALL] to the minuspod service. App already drops to UID 1000 via setpriv; nothing needs caps after start. - Dockerfile + Dockerfile.cpu: pin pip==25.2 and setuptools==80.9.0 for reproducibility (previously --upgrade with no floor). - .githooks/pre-commit: tighten the Bearer regex post-filter to strip obvious placeholder hits (example/placeholder/test/your-/fixture/ sample/dummy/fake/redacted), document the line-oriented multi-line limitation. Also add an ASCII-glyph block that fails commits introducing em-dashes, smart quotes, or U+2605 stars. - .github/workflows/*.yml: pin every third-party action use to its commit SHA with a human-readable comment. Closes the floating-major- tag supply-chain exposure flagged in the audit. Security headers - src/main_app/__init__.py: extend the response hook so application/ json responses get Content-Security-Policy: default-src 'none'; frame-ancestors 'none'. Matches the lockdown already applied to /feed/* RSS. text/html CSP unchanged. - tests/integration/test_security_headers.py: flip test_csp_absent_on_json_responses -> test_csp_locked_down_on_json_ responses to assert the new presence + values. Imports - src/api/episodes.py: hoist audio_peaks, chapters_generator, llm_client, processing_queue out of inline function calls. main_app.processing and ad_detector stay inline (circular). - src/api/feeds.py: hoist urllib.parse.urlparse and defusedxml.ElementTree (aliased DefusedET to avoid shadowing the existing xml.etree.ElementTree alias used for OPML export). - src/audio_fingerprinter.py: replace two inline `import acoustid` calls with one module-level guarded import; methods now branch on `acoustid is None`. - src/processing_timeouts.py: investigated; hoisting still triggers the circular import path through status_service. Kept inline (audit finding was a false positive). Dead code - src/ad_detector.py: drop the INVALID_SPONSOR_REASONS class alias (kept "for old name" per its own comment); two call sites updated to use module-level INVALID_SPONSOR_VALUES directly. ASCII sweep - 66 U+2014 em-dashes and 1 U+2605 star replaced across 26 files (mostly comments, docstrings, a few <option> labels in PatternImportDialog). No smart quotes were found. Future regressions are now blocked by the pre-commit ASCII guard. Verification: 1177 pytest passed, frontend tsc clean. * chore(detection): drop dead network_id parameter from process_transcript The audit flagged a `network_id` parameter that was plumbed three layers deep but never set by any caller of AdDetector.process_transcript. Confirmed: both call sites (src/main_app/processing.py:314 and src/api/episodes.py:837) pass only podcast_id, never network_id. The forward to text_pattern_matcher.find_matches drops the kwarg; that signature keeps network_id because pattern_service.py and other matcher entry points genuinely use it. The second audit finding for this phase -- "podcast_id=slug is mis-scoped" -- turned out to be wrong. ad_patterns.podcast_id is declared TEXT (no FK to podcasts.id; the schema explicitly survives content deletion), and patterns are both written and read using the slug. The field is named misleadingly but the equality check in _filter_patterns_by_scope is consistent. No fix needed. * chore: phase-3 helper consolidation (backend utils + frontend hooks) Consolidate duplicated logic surfaced by the audit. Behavior-preserving. Backend - src/utils/ttl_cache.py: single TTLCache(ttl_seconds) class. Three modules now share it: cleanup_service (settings cache, 300s), llm_client (provider cache, preserving the existing threading.Lock and a _CACHED_NONE sentinel so None results don't re-query), sponsor_service (freshness gate; instance dicts retained because they're read across many methods). - src/utils/time.py: new overlap_ratio(start_a, end_a, start_b, end_b) and ranges_overlap(... , tolerance=0.0). ad_detector._compute_overlap and audio_analysis.base.AudioSegmentSignal.overlaps now one-line wrappers (signatures preserved for test introspection); ad_validator._overlaps_corrections drops its inline math. - src/utils/constants.py: NON_BRAND_WORDS (formerly ad_detector ._NON_BRAND_WORDS) moved to the central constants module with a docstring noting INVALID_SPONSOR_CAPTURE_WORDS is intentionally separate (capture-start filler vs. brand-disqualifier). Alias kept in ad_detector for any in-file references. - src/sponsor_service.py: gains extract_sponsor_from_reason(text) and extract_sponsors_from_transcript(text, ad_reason). The two ad_detector duplicates now delegate. extract_sponsor_names module shim retained because tests import it directly. Frontend - frontend/src/components/Artwork.tsx + ARTWORK_FALLBACK_SVG: replaces the 200-char inline data: SVG that was copied into FeedCard, FeedListItem, EpisodeDetail, FeedDetail. - frontend/src/hooks/useLocalStorageState.ts: generic typed useState+localStorage with a legacy raw-string fallback so existing on-disk values don't get re-parsed. Migrated Dashboard, CollapsibleSection, EpisodeDetail. - frontend/src/hooks/useSyncFromQuery.ts: render-phase seed hook. Settings.tsx three near-identical snapshot-vs-data blocks now collapse to three useSyncFromQuery calls. Kept render-phase rather than useEffect-based per existing project history that intentionally moved this off useEffect. - frontend/src/components/Layout.tsx: NAV_ITEMS table + NavLink subcomponent. 234 -> 156 LOC. lastPath state replaced with useEffect([location.pathname]). - frontend/src/utils/detectionStage.ts: DETECTION_STAGE_META lookup replaces the stringly-typed ternary chain in EpisodeDetail.tsx:500. - frontend/src/api/client.ts: apiFileRequest helper. Four blob downloads (settings, history, feeds-OPML, patterns) drop their hand-rolled Content-Disposition / error-shape parsing. feeds.ts importOpml is a JSON-response upload (not a blob download) so it stays on raw fetch but now uses the shared extractErrorMessage. Verification: 1177 pytest passed, frontend tsc clean. * refactor: phase-4 type tightening (PatternScope, EpisodeStatus, SettingEntry) Three audit findings: stringly-typed fields and dict-shape leaks. Behavior-preserving; the enum value's str subclass means existing literal comparisons keep working. Frontend - PatternScope = 'podcast' | 'network' | 'global' exported from frontend/src/api/patterns.ts. Four bare `string` sites on AdPattern, PatternStats, getPatterns params, and updatePattern updates now reference it. AdEditor.DetectedAd.scope also tightened. PatternCorrection.scope stays narrower ('podcast' | 'global') because user-driven create-mode only exposes those two. Backend - src/utils/constants.py: new EpisodeStatus(str, Enum) covering the seven values that actually appear in the codebase (discovered, pending, processing, processed, failed, permanently_failed, plus the API-only 'completed' alias that api/episodes.py substitutes for 'processed' in responses). Source of truth pulled from the schema CHECK constraint. - Migrated the high-traffic mutation sites: src/api/episodes.py (11 sites), src/main_app/processing.py (6 sites, leaving 2 in record_processing_history whose table uses a different vocab), src/main_app/routes.py (6 sites), src/database/episodes.py (1), src/main_app/background.py (1). Raw SQL string literals left alone -- moving those needs parameter binding, no type-safety benefit at the SQL boundary. - src/api/settings.py: new frozen SettingEntry(value, is_default) dataclass + _settings_view() wrapper. _setting_value and _setting_is_default now accept either SettingEntry or the legacy dict shape so the seven consumers of db.get_all_settings() outside this module aren't affected. Scope-guard applied: wrap at the API-helper layer rather than changing the producer. Verification: 1177 pytest passed, frontend tsc clean. * chore: phase-7+8 frontend mid-tier + remaining security follow-ups Eight items from the audit's lower-priority pile, batched. Security - src/api/auth.py: /auth/logout now manually validates the CSRF token on top of the existing AUTH_EXEMPT_PATHS exemption. Unauthenticated callers still bypass (csrf.validate returns None when session is unauthenticated), preserving the "always callable to clear stale state" property. Closes the cross-site-logout finding. - src/community_sync.py: _fetch_manifest now routes through safe_http.safe_get(trust=URLTrust.OPERATOR_CONFIGURED, stream=True) + read_response_capped(resp, 256*1024). Matches the SSRF posture already used in rss_parser. The manifest URL is a build-time constant today, but the wrapper protects any future setting that exposes it. - src/rss_parser.py: DOCTYPE pre-scan widened from 4 KB to 64 KB so a feed with extensive leading whitespace/comments still triggers the xml_forbidden_construct warning. defusedxml still gates XXE either way. - entrypoint.sh: added -xdev to both find calls and a pre-check that emits a WARN line (continues, doesn't abort) when DATA_DIR is owned by a uid that's neither 0 nor APP_UID. Warning-only first release so we see how it behaves on existing volumes before tightening to exit 1. Code consolidation - src/llm_client.py: extracted shared retry/circuit-breaker skeleton into LLMClient._send_with_fallback. The two concrete messages_create methods (Anthropic, OpenAI-compatible) now define a local _send closure and delegate the bookkeeping. ~60 LOC of duplicate try/except gone. Request kwargs still per-provider since Anthropic and OpenAI shapes diverge. - src/ad_detector.py: replaced the five lazy @Property accessors (db, audio_fingerprinter, text_pattern_matcher, pattern_service, sponsor_service) with a single _ensure_deps() called from initialize_client and the four DB-reading getters. Attributes are now plain. Removes the thread-race window on first access. Test fixtures in tests/unit/test_ad_detector* updated to match the plain-attribute surface. Frontend - frontend/src/pages/StatsPage.tsx: SortTh replaced with a makeSortTh factory closure-bound to sortField/sortDir/handleSort via useMemo. Eight call sites lose three props each. - frontend/src/pages/Settings.tsx: dropped the useEffect that was force-writing 'settings-section-system-status' to true on every load (overriding the user's collapse choice). CollapsibleSection.defaultOpen already covers the same need cleanly. Verification: 1177 pytest passed, frontend tsc clean, bash -n entrypoint.sh ok. * refactor(api): decompose update_ad_detection_settings + pattern handlers PR 5.1 + 5.2 from the audit follow-up plan. src/api/settings.py: update_ad_detection_settings dropped from a 263-line handler to 34 lines of dispatch. The audit's reference to nested _tu/_sv closures was stale -- those were already lifted to _apply_stage_tunables in a prior pass -- but the remaining sequential `if 'key' in data:` chain has now been split into nine named phase helpers that each take (db, data) and return None or an error response: _apply_prompt_fields, _apply_review_fields, _apply_model_fields, _apply_processing_flags, _apply_min_cut_confidence, _apply_provider_fields, _apply_whisper_fields, _apply_vad_gap_fields, _apply_podcast_index_fields. Handler iterates a tuple of callables; ordering and every set_setting / error string / log message is preserved verbatim. src/api/patterns.py: three large route bodies decomposed. - submit_correction: 247 -> 44 LOC. Per-action work moved to _handle_confirm_correction, _handle_reject_correction, _handle_adjust_correction, _maybe_rewrite_pattern_from_adjustment, with shared _resolve_or_create_pattern_from_text covering the dedup-or-create-from-text path (label param toggles log wording). - _submit_correction_create: 120 -> 77 LOC. Pulled out _validate_create_correction_input and _insert_manual_marker. - import_patterns: 136 -> 40 LOC. Pulled out _validate_import_request, _validate_import_items, _upsert_import_pattern, _apply_pattern_imports. Behavior preservation: - 1177 pytest passed (baseline) -> 1177 pytest passed (after). - Pattern-specific suites (84 tests) and settings-specific suites (15 tests) all pass. - Route signatures, JSON contracts, error message strings, log message strings, validation rules, and ordering all unchanged. - inline imports of PatternService inside _handle_adjust_correction preserved (original called it twice with the same shape; keeps mock-based test timing identical). * refactor(detection): decompose learn_from_detections into 4 phase helpers PR 5.3 from the audit follow-up. HIGH-risk per the audit notes because pattern memory persists across episodes -- a regression would only surface on the NEXT episode encountering a similar ad. Held to behavior-byte-identical: same signature, same control flow, every log call preserved verbatim, same active_pattern_sponsors preload position. src/ad_detector.py: learn_from_detections shrank 151 -> 52 LOC. Logic moved into four private methods on AdDetector: - _ad_passes_learning_filters(ad, min_confidence): was_cut + stage + confidence floor + long-ad stricter threshold (34 LOC). - _resolve_sponsor_for_learning(ad): 4-tier sponsor resolution + canonicalization (37 LOC). - _sponsor_blocked_by_gates(sponsor, active_pattern_sponsors): Gate A (prefix-of-known) + Gate B (unknown short single word) (34 LOC). - _create_pattern_and_fingerprint(...): create_pattern_from_ad + optional fingerprint store + per-ad try/except (41 LOC). The orchestrator now reads as the audit-described pipeline: filter -> resolve sponsor -> gates -> create. Verification: 1177 pytest passed (baseline) -> 1177 passed (after); test_ad_detector_learn_from_detections.py 7/7; test_ad_detector.py 30/30. No log line changed, no exception swallow widened. * refactor(processing): decompose four pipeline-stage helpers PR 5.4 from the audit follow-up. HIGH-risk per the audit because these are the actual production ad-removal pipeline; a regression would surface as wrong cuts in real episodes, not as a unit-test failure. Held to behavior-byte-identical: public signatures unchanged, every audio_logger call preserved verbatim, lazy-import pattern preserved, component-resolution order unchanged (helpers take db / storage / ad_detector explicitly rather than re-calling _get_components per helper). src/main_app/processing.py: - _run_ad_reviewer 95 -> 67 LOC. Extracted _apply_reviewer_verdict_to_ad, _merge_reviewer_result. - _refine_and_validate 124 -> 67 LOC. Extracted _refine_boundaries, _apply_heuristic_rolls, _load_user_corrections, _gate_validation_by_confidence. - _finalize_episode 109 -> 26 LOC. Extracted _persist_episode_state, _refresh_rss_for_slug, _log_completion_summary, _record_history_and_event. Big drop because the function was almost entirely sequential side effects with no shared local state between phases. - _run_verification_pass 157 -> 93 LOC. Extracted _apply_pass2_heuristic_rolls, _validate_verification_ads, _gate_verification_ads_by_confidence, _recut_processed_audio. Did not push under 90 because the outer try/except wrap + nested control flow doesn't benefit from further splitting. Verification: pytest at 1177 passed after each individual function decomposition (four checkpoints) and 1177 passed after the final state. No log line text changed; lazy imports preserved. * refactor(processing): EpisodeContext dataclass for pipeline plumbing PR 5.5 from the audit follow-up. Holds the immutable per-episode context shared across the detection pipeline so each stage helper can take one object instead of a long parameter list. Mutable plumbing (progress_callback, cancel_event, audio_analysis, audio_path, skip_patterns, segments) stays explicit because it isn't "context". New file src/main_app/episode_context.py: - @DataClass(frozen=True) EpisodeContext with: slug, episode_id, podcast_name, episode_title, podcast_id, podcast_description, episode_description, podcast_tags. src/ad_detector.py: - process_transcript gains a keyword-only `ctx=None` parameter that unpacks into the existing locals when supplied. Old positional signature preserved so api/episodes.py:838 doesn't need to change. - When ctx is provided, podcast_id for pattern-scoping is set from ctx.slug (matches the long-standing call convention of passing slug-as-podcast_id; ad_patterns.podcast_id is TEXT, not the integer FK, see prior commit's note). src/main_app/processing.py: - process_episode builds ctx once after stage-2 audio analysis and threads it through the migrated helpers. Podcast-row + tags lookup consolidated to a single site (was inside _detect_ads_first_pass). - _detect_ads_first_pass: 11 -> 6 params. - _run_verification_pass: 13 -> 7 params. - _apply_pass2_reviewer: 12 -> 6 params. Out of scope by design: _run_ad_reviewer / _build_episode_meta (uses integer DB PK podcast_id and depends on a separate podcast_row fetch right before the call -- moving it would shift the DB read upstream, which the "Don't dismiss race warnings" feedback memo cautions against). _process_episode_background (9 params; all are external IO forwarded to process_episode, not pipeline-internal). Verification: 1177 pytest passed (baseline) -> 1177 passed (after). Diff: 3 files, +96 / -33 LOC, well under the 400-LOC scope cap. * refactor: phase-6 structural splits (ad_detector, schema, AdReviewModal) Three big-file decompositions from the audit. All HIGH-risk per the plan. Strategy was pure code-motion only -- git mv for the central renames so blame survives, no behavior change, no signature or log-line edits, no SQL DDL byte changes. PR 6.1: src/ad_detector.py -> src/ad_detector/ package. - __init__.py 1403 LOC: AdDetector class + explicit named re-exports covering all 28 public names tests and production import (audited via `grep "from ad_detector import"` repo-wide, including the underscored ones tests reach into directly: _NON_BRAND_WORDS, _extract_ad_keywords, _find_keyword_region, _unpack_region, _text_has_ad_content, _find_json_array_candidates). - boundaries.py 857 LOC: refine/extend/snap/merge/validate/dedupe helpers + transition phrases. - prompts.py 393 LOC: USER_PROMPT_TEMPLATE, create_windows, format_window_prompt, get_static_system_prompt, parse_ads_from_response. - Central file dropped 2575 -> 1403 LOC (-45%). Net +78 LOC from added module docstrings and duplicated imports. - Git records: R src/ad_detector.py -> src/ad_detector/__init__.py. PR 6.2: src/database/schema.py -> src/database/schema/ package. EXTREMELY HIGH risk per CLAUDE.md "never lose data on migrations". Pure code motion only. SHA-256 of SCHEMA_SQL and MIGRATION_INDEXES_SQL identical before/after the split (13aac307..., a73d3d14...). No MIGRATIONS registry exists in this codebase to reorder; the migration sequence is encoded in a single 770-line _run_schema_migrations method that was not touched. Only the two SQL DDL string constants moved out to tables.py (306 LOC); re-imported from __init__.py for back-compat. SchemaMixin class stays in __init__.py. - Migration-focused tests (test_migration_sponsor_fk, test_only_expose_processed_migration, test_settings_migration, test_database): 92 passed. PR 6.3: frontend/src/components/AdReviewModal.tsx decomposition. 3 of 6 audit-planned extractions: - frontend/src/utils/adReviewHelpers.ts: state-free helpers (parseTimeInput, formatTime, getThemeWaveformColors, loadPlayWhileDragging, savePlayWhileDragging, PLAY_WHILE_DRAG_KEY). - frontend/src/components/ad-editor/Pin.tsx: Pin subcomponent + PinProps interface. Self-contained, owns its own dragging state. MIN_AD_DURATION=1.0 duplicated locally to avoid a circular import; comment in Pin.tsx flags the two-site invariant. - frontend/src/components/ad-editor/usePeaks.ts: peaks fetch/cancel effect + the three peaks state vars. setPeaks(null) on resetView dropped because the hook already clears+re-fetches on resetTick. - AdReviewModal.tsx: 1654 -> 1448 LOC (-12.5%). Default export and named exports unchanged. Scope-limited per the audit's "if extraction causes subtle reactivity changes, revert" guard: - useWavesurfer NOT extracted -- wavesurfer mount deliberately omits adStart/adEnd from deps (eslint-disable); wrapping would either reintroduce deps (changes reactivity) or require ref-passing that complicates the surface. - AdReviewToolbar + Cursor + usePinDrag NOT extracted -- prop counts per component would balloon to 8-15 for marginal LOC savings. Verification: 1177 pytest passed, frontend tsc clean. * refactor(main_app): drop _get_components tuples in favor of direct imports PR 6.4 from the audit. Four files were doing positional unpacking of a 4-10 element tuple returned from a local _get_components() helper: - src/main_app/processing.py:35 10-tuple, 13 call sites (mostly _,_,_,...) - src/main_app/routes.py:67 4-tuple - src/main_app/background.py:13 5-tuple - src/main_app/feeds.py:31 5-tuple Tuple-reorder was a silent failure mode -- every caller used positional unpacks, so a reorder in one place would mis-bind globals everywhere. All four files now do `from main_app import db, storage, ...` at the top and reference singletons directly; the _get_components helpers are deleted. No circular import: singletons are instantiated in main_app/__init__.py:146-159 before the explicit submodule imports at lines 616-622, so any `import main_app.<sub>` triggers a full __init__.py run-through first and the names are bound. Test fallout fixed in the same commit: - tests/unit/test_feed_304_refresh.py: 2 tests were patching main_app.feeds._get_components; repointed to patch the individual singletons (db, rss_parser, storage, status_service, pattern_service) on main_app.feeds. - tests/unit/test_head_request.py: 8 tests were patching on the main_app package namespace; once routes.py binds those names locally via `from main_app import ...`, patches at the package level don't affect them. Repointed to main_app.routes.{...}. Verification: pytest at 1177 passed after each of the four file conversions (four checkpoints) and at the final state. Net -32 LOC. * fix(ci): unbreak eslint static-components + codeql py/reflective-xss Two CI failures from PR 232: 1. eslint react-hooks/static-components (9 errors in StatsPage.tsx). The Phase 7+8 makeSortTh factory returned a component during render, which the rule rejects because per-render component identity churn forces React to remount every <SortTh /> on each state change. Reverted SortTh to a stable module-level function component that takes sortField/sortDir/onSort as explicit props. The eight call sites pass them through; the win the audit asked for (no per-instance click handler rebuilding via closure factory) is now achieved via React's normal prop equality. 2. CodeQL py/reflective-xss (HIGH) flagged src/api/patterns.py:1065 inside _validate_import_request after the Phase 5.2 decomposition. The path was a false positive (json_response uses jsonify which HTML-escapes), but CodeQL's interprocedural analysis lost line-of-sight when patterns/mode moved behind a helper that received the raw `data` dict. Hoisted the data.get() extraction back to the caller; the validator now takes already-extracted (patterns, mode) args, so the request -> validate -> error_response flow is visible inline at import_patterns. _IMPORT_MODES tuple constant replaces the inline literal list (one source of truth for the accepted values). Verification: 1177 pytest passed, frontend tsc + eslint clean. * fix(ci): restore Layout render-phase pathname compare + inline mode validation Second round of CI fixes on PR 232. 1. Layout.tsx -- react-compiler eslint rule rejects setState inside useEffect ('Calling setState synchronously within an effect can trigger cascading renders'). The Phase 3 change replaced the original render-phase lastPath compare with the standard useEffect([location.pathname]) pattern. The React docs explicitly recommend the render-phase idiom for 'reset state on prop change' (https://react.dev/learn/you-might-not-need-an-effect#adjusting- some-state-when-a-prop-changes), and the new lint rule enforces that recommendation. Restored the original pattern; dropped the now-unused useEffect import. 2. src/api/patterns.py -- CodeQL py/reflective-xss still tripped on the helper-returned 'mode' even after hoisting data.get() to the caller. CodeQL's interprocedural analysis won't trust a narrowing that happens inside a called function. Replaced the helper-side mode validation with an inline if/elif/else chain at the request boundary that binds the value to one of three literal strings. That gives CodeQL a direct line-of-sight from request.get_json() to a bound check it can prove. _validate_import_request still takes mode as an arg but no longer makes its own membership check (the caller already did). Verification: 1177 pytest passed, frontend tsc + eslint clean. * fix(ci): inline the import-patterns response paths so CodeQL sees the literal mode Third CI fix. CodeQL py/reflective-xss kept tripping on the empty-patterns no-op response inside _validate_import_request, which echoed `{'mode': mode, ...}`. mode is constrained to one of three literal strings by the inline if/elif/else in the caller, but CodeQL's interprocedural analysis won't trust that narrowing across a helper boundary -- it sees a tainted value flowing into a response body. Moved the no-op response and the empty-replace 400 directly into import_patterns. The helper now only computes a boolean predicate (_is_empty_replace_request). All response construction happens at the caller, where mode is provably one of 'merge' / 'replace' / 'supplement' from the inline narrowing earlier in the function. The data flow is now linear and visible: data.get('mode') -> bound check -> response body. Verification: 1177 pytest passed. * chore: bump to 2.4.9 for the audit-cleanup release Release notes consolidated under [2.4.9] in CHANGELOG.md. The release ships the security HIGH fixes (session rotation, fail-loud secret key, CSRF on logout, JSON CSP, community_sync safe_http, entrypoint hardening) plus 50+ structural refactors held to behavior-byte-identical. Full pytest 1177 + frontend tsc clean at every commit boundary on this branch. openapi.yaml info.version bumped in sync with version.py per the project rule. * test(smoke): phase B - cover the three 2.4.9 security additions Three new local smoke tests that exercise the security HIGHs landed earlier in this PR. Verified against the 2.4.9 image (ttlequals0/minuspod:2.4.9) running as the smoke container. T19-session-rotation (3/3 passing locally) - Seed a session cookie on an unauth GET /auth/status. - Login carrying that cookie. - Assert the post-login Set-Cookie session= value differs from the pre-login one. Flask's signed cookies are deterministic for the same session payload, so a non-rotated session would produce identical strings; difference proves session.clear() ran before the re-init. Also verifies the new cookie successfully authenticates a follow-up /auth/status. T20-logout-csrf (5/5 passing locally) - POST /auth/logout without X-CSRF-Token -> 403 (rejected). - Authenticated state preserved across the rejected attempt. - POST /auth/logout with X-CSRF-Token -> 200. - Authenticated state cleared after the successful logout. - Unauthenticated POST /auth/logout still returns 200 (the "always callable to clear stale state" property is preserved; csrf.validate returns None when session.authenticated is False). T21-json-csp (4/4 passing locally) - GET JSON endpoint includes Content-Security-Policy: default-src 'none'; frame-ancestors 'none'. - HTML /ui/ CSP still present (catches regressions in the text/html branch caused by the JSON addition). - JSON error response (POST /auth/login with empty body) also ships the locked-down CSP. Wired into run-all.sh's local sequence between 18-multi-worker and 17-shutdown (17 runs last because it kills the container). Not covered here (deferred by design): - SECRET_KEY fail-loud at startup: hard to exercise from outside the container without sabotaging the data dir; covered better by a unit test against get_or_create_secret_key directly. * test(smoke): phase C - pattern system end-to-end + per-test IP isolation Five new local smoke tests covering the pattern endpoints that were previously untouched (~21 of 30 uncovered routes). Verified against the 2.4.9 image (ttlequals0/minuspod:2.4.9). All 8 Phase B+C tests pass 44/44 assertions in a clean sequential run. T22-pattern-import-modes (4/4) - merge / supplement modes round-trip. - replace with empty array -> 400 (fat-finger guard). - invalid mode -> 400. T23-pattern-readonly (10/10) - /patterns/stats, /patterns/health, /patterns/contaminated return well-shaped JSON. - /patterns?scope=, /patterns?source=, /patterns?active=false filter without erroring. - /patterns/deduplicate and /patterns/backfill-false-positives return 200 (idempotent on a clean DB). T24-pattern-crud (9/9) - Per-pattern CRUD: import -> GET <id> -> PUT update -> verify persisted. GET on missing id -> 404. - bulk-disable + bulk-delete with their expected_count guard and the without-confirm 400 path. T25-pattern-community (3/3) - /patterns/preview-export returns ready/rejected shape. - /patterns/submit-bundle returns a JSON bundle. - GET /community-patterns/sync-status returns 200. - (/community-patterns/all is DELETE-only and destructive; not exercised here.) T26-pattern-edge-cases (6/6) - /patterns/<id>/split on a non-contaminated pattern -> 400. - /patterns/merge with empty source_ids -> 400. - /patterns/<id>/protect on a local-source pattern -> 400 (community-only). - protect on missing id -> 404. - unprotect on a local pattern -> 200 (intentionally idempotent; no source guard on the DELETE path). - submit-to-community on missing id -> 400 (the export pipeline raises ExportError before any 404 lookup). Per-test IP isolation (smoke/lib/common.sh) - New smoke_ip helper returns a per-test, per-run unique private IPv4 (10.<test>.<derived>.<derived>) for spoofing in X-Forwarded-For. /patterns/import has a 3/hour per-IP rate limit and /auth/login a 3/min per-IP limit; reusing the same IP across runs starves the quota and tests fail for the wrong reason. - login() helper extended with an optional 4th X-FF arg; all of T19-T26 now log in via a per-test unique IP. Smoke covers (after this commit): - T19 session rotation, T20 CSRF on /auth/logout, T21 JSON CSP (Phase B, 2.4.9 security additions). - T22-T26 pattern system surface (Phase C). - Pre-existing T14 still tests the export -> reimport happy path. Run-all sequence updated to include 22-26 between 21-json-csp and 17-shutdown so 17 stays last (it kills the container). * refactor(smoke): apply /simplify findings to phase B+C tests Three review agents (reuse / quality / efficiency) ran against the Phase B+C smoke tests and surfaced ~200 LOC of duplication plus dead code in T22. All 8 tests still pass 44/44 after the refactor. New helpers in smoke/lib/common.sh - setup_authed_jar <test_num>: create jar, login from a per-run unique X-FF IP, grab CSRF, register an EXIT trap so the jar is cleaned up even when an assertion aborts the script. Exports JAR and CSRF globals. - auth_json <method> <path> <test_num> <body>: authenticated JSON-body request returning the response body on stdout. Bundles Content-Type + X-Forwarded-For + X-CSRF-Token in one call. - auth_code <method> <path> <test_num> [body]: same shape but returns the HTTP code instead of the body. Used when the assertion is "expect this status". - json_get <key>: read stdin, print the top-level JSON field null-safely. Replaces ~12 inline `python3 -c 'import json,sys; print(json.load(sys.stdin).get(...))'` blocks. - find_pattern_id_by_sponsor <sponsor>: look up the test pattern's id after import (the import endpoint returns counts only, not ids). Replaces 4 identical 8-line filter-by-sponsor blocks. - import_test_pattern <test_num> <sponsor> <text>: build the payload, import with mode=merge, return the assigned id. Pairs with bulk_delete_pattern for setup/teardown. - bulk_delete_pattern <test_num> <id>: cleanup helper satisfying the bulk-delete expected_count fat-finger guard. - header_value <name>: case-insensitive header extractor from a curl -i dump. Replaces 3x awk pipelines in T21. Dead code removed - smoke/local/22-pattern-import-modes.sh: deleted post_import and post_import_body helpers. Both were defined-but-unused; the caller at the empty-replace path invoked post_import_body with curl-style flags as positional args, then immediately overwrote the result with a fresh inline curl ("the wrapper above swallows curl args; redo cleanly"). The first invocation also burned a rate-limit slot for nothing. Replaced with auth_code helper. Assert tightening - T26 split-on-non-contaminated: was assert_in "400 404"; the pattern was just imported so 404 would mask a regression. Tightened to assert_eq 400. - T26 submit-to-community on missing id: same; the handler returns 400 with ExportError reasons, never 404. Tightened to assert_eq 400 with a comment explaining why. Test files dropped 696 -> 424 LOC (-272). common.sh +137 LOC for the helpers. Net -135 LOC, plus the duplication that made each new test feel like the previous one is gone. Verification: 8/8 tests still PASS, 44/44 assertions, in a clean sequential run against ttlequals0/minuspod:2.4.9. * test(smoke): phase D - settings system coverage (T27-T30) Four new local smoke tests covering the settings system end-to-end: 12/12 tests now pass 74/74 assertions in a clean sequential run against ttlequals0/minuspod:2.4.9. T27-settings-readonly (10/10) - GET /settings, /settings/models, /settings/whisper-models, /settings/retention, /settings/audio, /settings/processing-timeouts, /settings/reviewer, /settings/community-sync, /settings/webhooks, /networks return well-shaped JSON. T28-settings-roundtrip (8/8) - PUT /settings/retention (with negative-value 400 rejection). - PUT /settings/audio (both keepOriginalAudio values; missing-key 400). - PUT /settings/processing-timeouts (both values; missing-key 400). - Uses the new poll_for helper to handle 2-worker memory:// cache staleness (a PUT invalidates one worker's cache; a follow-up GET can land on the other worker). T29-settings-resets (4/4) - POST /settings/ad-detection/reset (idempotent on second call). - POST /settings/prompts/reset. - GET /settings still returns 200 after both resets. T30-webhook-crud (8/8) - POST /settings/webhooks/validate-template (happy + malformed). - POST /settings/webhooks creates with valid event name from VALID_EVENTS ("Episode Processed"). - PUT /settings/webhooks/<id> updates events. - GET /settings/webhooks lists the new hook. - POST /settings/webhooks/<id>/test returns {success: bool} (delivery to example.invalid will fail safely; we just check call shape). - DELETE /settings/webhooks/<id> + confirm gone. New helper in smoke/lib/common.sh - poll_for <endpoint> <key> <expected> [tries]: GET an endpoint up to N times, returning early when json_get(key) == expected. Promoted from a local T28 helper because future settings/community- sync round-trips and any other cache-backed endpoint will hit the same multi-worker staleness window. Run-all sequence updated to include 27-30 between 26 and 17-shutdown. /simplify review applied: - poll_for promoted to common.sh. - T29's raw curl for /settings shape check replaced with auth_code. - T30's /test endpoint call now goes through auth_json (was missing X-Forwarded-For and skipping per-IP rate-limit rotation). - T30's pass/fail control flow restructured from && / || chain to explicit if/else for clarity. - T28 has a comment explaining the True/False string coupling to json_get's Python `print(bool)` output. Found one real API contract discovery during test development: - Webhook events use title-case spaced names ("Episode Processed", "Auth Failure", "Episode Failed") per src/webhook_service.py VALID_EVENTS, not snake_case. Tests use the correct values. * test(smoke): phases E, F, G - episode/feed/search/history/docs/SSE Six new local smoke tests bringing Phase B+C+D+E+F+G to 18 tests covering ~70 API routes. All 18 pass clean against ttlequals0/minuspod:2.4.9 running locally. Phase E - episode + feed surface T31-episode-error-paths (15/15) - 7 per-episode GETs on missing slug/id -> 404/400. - 4 per-episode mutations (reprocess, regenerate-chapters, cancel, retry-ad-detection) on missing -> 404/400. - alias POST /episodes/<slug>/<id>/reprocess -> 404/400. - bulk endpoint with empty payload -> 400/404. - 2 feed-level mutations on missing slug -> 404/400. T32-feed-listing (4/4) - GET /feeds returns list shape (empty on fresh DB). - GET /episodes/processing returns 200. - POST /feeds/refresh no-op on empty fleet. - GET /feeds/export-opml returns 200. Phase F - search / history / docs / liveness T33-search (4/4) - GET /search?q=... returns 200. - GET /search/stats returns dict shape. - POST /search/rebuild idempotent. - /podcast-search accepts 200 / 4xx / 5xx (smoke env has no Podcast Index key, so accept either configured or unconfigured behavior). T34-history (3/3) - GET /history list shape. - GET /history/stats returns 200. - GET /history/export returns 200 (downloadable blob). T35-docs-and-liveness (6/6) - /api/v1/docs and /docs/ Swagger UI 200/redirect. - /openapi.yaml first line is `openapi:`, info.version is SemVer. - /health/live cheap liveness probe. - /api/v1/health deep readiness probe reports status=healthy. Phase G - SSE T36-sse-stream (3/3) - Unauthenticated /status/stream emits `event: auth-failed` and the server closes the stream (verified via curl --max-time 3s). - Authenticated /status/stream emits the initial `data:` snapshot. - Authenticated stream does NOT emit auth-failed. Notable fix during T35 development: piping a 170+ KB openapi.yaml through `head -1 | grep` races head's pipe-close vs printf's write and triggers SIGPIPE 141, which set -o pipefail turns into a false- negative assertion failure. Replaced the pipe with `curl -o file` plus pure-bash parameter expansion and `awk` field extraction. Run-all sequence updated to include 31-36 between 30 and 17-shutdown. * fix(smoke): auto X-FF on login + checkpoint WAL before docker cp + lockout pacing + XXE log regex Final fixes after running the full smoke suite end-to-end. All 36 tests (T01-T36) plus 00-setup and 99-teardown now PASS, 0 FAIL, 2 SKIP (T08 artwork + T12 RSS public paths, both by design on a fresh DB without fixture data). smoke/lib/common.sh - login() auto-X-FF - Default xff (when caller passes none) is now `smoke_ip <test_num>` derived from TEST_NAME. Without this, the cumulative /auth/login calls from 30+ tests all hit the loopback IP and the 10/hour per-IP cap kicks in around test 11, making every downstream test fail with 401 instead of the assertion it's trying to verify. smoke/local/13-backup.sh - WAL checkpoint - The provider_crypto_salt that the backup endpoint persists may still be buffered in podcast.db-wal at docker-cp time. Now run `python3 -c sqlite3.PRAGMA wal_checkpoint(FULL)` via docker exec before copying so the salt is in the main DB file (the runtime image has python3 but not the sqlite3 CLI). smoke/local/11-lockout.sh - 65s gap before 6th attempt - The lockout test paces 5 wrong-password attempts 22s apart, then fires the 6th to verify the 429+Retry-After response. The cumulative 5 attempts fit flask-limiter's 3/min sliding window, but the 6th attempt was getting 429 from the limiter (which doesn't include Retry-After) before the route handler's lockout check could run. Added a 65s sleep before the 6th attempt so the limiter window clears and the lockout path actually fires. smoke/local/07-xxe.sh - accept SSRF-blocked log line - The smoke XXE feed is hosted at <gateway>:18074 (a non-standard port). The SSRF validator blocks non-standard ports before the XML parser ever sees the body, so the defusedxml log line never fires. The defense is still working (verified by /etc/passwd absence), so the regex now accepts "SSRF blocked" alongside "xml forbidden construct" as evidence of rejection. End-to-end suite run: 36 PASS, 0 FAIL, 2 SKIP. 100% pass rate against ttlequals0/minuspod:audit-cleanup running locally. * fix(settings/ui): disable Test on unsaved drafts + explain why input clears after Save Two UX fixes to ProviderKeyField (shared by LLM Provider section AND the Transcription whisper-API key) that close a confusion path real users have hit: 1. Test was clickable on a typed-but-unsaved draft. The /providers/<provider>/test endpoint reads the SAVED key from the DB, not the React draft state. Clicking Test before Save returned "no key configured" even though the input clearly held a value, leading users to file "field not set" bug reports. Now: when `draft.length > 0 && status.source !== 'db'`, the Test button is disabled and carries a title tooltip: "Click Save first -- Test reads the saved key, not the unsaved draft." 2. After Save the input cleared with no explanation. The field is wiped post-save because the key is stored encrypted and we don't re-display secrets. The placeholder ("(stored - enter new value to change)") explained this but was easy to miss, so the blank input looked like "save erased my key" -- the other common user report. Now: a green inline "Saved -- input cleared because keys are stored encrypted" notice appears next to the action buttons for 4s on successful save. Verified both with Playwright against the local 2.4.9 image: - openrouter (status.source='none') with typed draft -> Test disabled with tooltip text matching the fix. - After Save -> input clears, "Stored encrypted" chip appears, green notice "Saved -- input cleared because keys are stored encrypted" renders. Subsequent Test runs reach the backend (returns connection failed against the fake endpoint, as expected). The fix is a one-file component change; LLM provider keys (anthropic, openrouter, openai, ollama) and the whisper API key all use the same ProviderKeyField component and get the new behavior uniformly. * fix(settings): close GH-234 -- Base URL Test failure, vanishing Save Changes, hydration/compare mismatch Three real bugs from issue #234 ("Unable to set LLM Provider base URL in the UI"). Verified end-to-end with Playwright after each fix. 1. Test reports "base URL not set" even after the user saves ----------------------------------------------------------- Root cause: the "Save" button next to the API key only hit PUT /providers/<name> with apiKey; the Base URL field's value was parked in React state, saved only by the global Save Changes button at the top of the page. A user who clicked Save next to the key (the obvious target) had the key persisted but no base URL, so the next Test failed with "base URL not configured". Fix: handleProviderKeySave now includes the current openaiBaseUrl (or whisperApiConfig.baseUrl for the whisper provider) in the same PUT body. One Save covers both. Settings.tsx:94-105. 2. Save Changes button vanishes after a successful save ----------------------------------------------------------- Root cause: get_effective_base_url / get_effective_provider read through a 5-second TTL cache (src/llm_client.py:_provider_cache). Settings PUT writes the DB but did NOT invalidate this cache. The post-mutation GET /settings refetch fired within the 5s window, returned the pre-write value, the UI re-hydrated state to the stale value, hasChanges flipped back to false, and the Save Changes button disappeared. Fix: new invalidate_provider_cache() helper in llm_client.py; called from _apply_provider_fields and both /providers/<name> PUT/DELETE handlers. 3. Save Changes always visible on a fresh load ----------------------------------------------------------- Root cause (discovered while verifying #2): the hydration block used `|| ''` for prompt/model fields while the compare block used `|| settings.defaults.X`. When the server stored an empty string and the default was non-empty (e.g. claudeModel saved as "" but default "claude-sonnet-4-5..."), hasChanges was true on load with no edits. Fix: aligned every compare line in computeChangedFields with the exact fallback used in the hydration block. Verification (Playwright against locally-built 2.4.9): - Save next to API key sends {apiKey, baseUrl}; Test now reaches the backend and returns "connection failed" against fake endpoint. - Post-save refetch sees fresh value; hasChanges flips false; Save Changes hides correctly. - Fresh load: Save Changes HIDDEN (no spurious dirty state). - Editing only Base URL -> Save Changes APPEARS. Frontend bundle rebuild on the next image build; static/ui is gitignored.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmarks/llm/src/benchmark/report.py(_render_pareto,_render_precision_recall_chart,_render_token_efficiency_chart) were assigning colors withcmap = plt.get_cmap("tab20")thencmap(i % 20). With more than 20 models in a sweep, the legend cycled colors so multiple models shared the same dot color. The current report has 28 models, so the Cost-vs-F1 chart had several visibly duplicate colors._distinct_colors(n)that concatenatestab20 + tab20b + tab20c(60 categorical colors with good perceptual contrast) and falls back to evenly-spacedhsvpast that, so every model gets a unique color up to 60 and remains unique past 60. Magic number is derived fromlen(palette), not hardcoded.colors = _distinct_colors(len(points))andcolors[i].pareto.svg,precision_recall.svg,token_efficiency.svg). Other charts color by value semantics (threshold bands, heatmaps, p50/p90/p99 categories), not model identity, so they were not affected and were not regenerated to keep this diff focused.Test plan
_distinct_colors(n)returnsnunique RGBA tuples forn in {5, 30, 60, 80, 200}(60 from categorical, >60 from hsv).benchmark report; SVGs render and the three target scatters show distinct colors per model./simplifyand/code-reviewran clean against the diff (no findings above the 80-confidence threshold).